how-twitter-uses-redis-to-scale-105tb-ram-39mm-qps-10000-ins

原文讲述作者在twitter使用redis进行scale-105tb-ram-39mm-qps-10000-ins的一些经验（其实主要列举了搭建large-scale系统需要考虑的一些情况，这点还是很值得看一下的；也讨论了这种环境下使用redis需要考虑的因素）

几个好的总结：
（嗯，第六点？）

Scale demands predictability.
The larger the cluster, the more customers, the more predictable and deterministic you want your service to be. When there’s one customer and there’s a problem you can dig into a problem and it’s intriguing. When you have 70 customers you can’t keep up.
Tail latencies matter.
When you do fanouts to a lot of shards, when one is slow your entire query will be slow.

Deterministic configuration is operationally important.
Twitter is moving towards a container environment. Mesos is used as the job scheduler. The scheduler fulfills the request for the amount of CPU, memory etc. A monitor kills any job that goes over its resource requirement. Redis causes a problem in a container environment. Redis introduces external fragmentation, meaning you use more memory to store the same amount of data. If you don’t want to be killed you have to compensate for that with oversupply. You have to think my memory fragmentation ratio won’t go over 5%, but I’ll allocate 10% more as a buffer space. Maybe even 20%. Or I think I’ll get 5000 connections per host, but just in case let me allocate memory for 10,000 connections. The result is a huge potential for waste. Super low latency services don’t play well with Mesos today, so these jobs are isolated from other jobs.

Knowing your resource usage at runtime is really helpful.
In a large cluster bad stuff happens. You think you are safe but things happen and behaviour is unexpected. Most services today can’t degrade gracefully. For example, when a limit of 10GB of RAM is reached then requests are rejected until there’s free RAM. This only fails a small percentage of traffic that’s proportional to the resource that they require. That’s graceful. Garbage collection problems are not graceful, traffic just gets dropped on the floor, this problem affects a lot of teams in a lot of companies every day.

Push computation to the data.
If you look at relative network speeds, CPU speeds, and disk speeds, it makes sense to do computation before going to disk and do computation before going to the network. An example is summarizing logs on a node before they are pushed to a centralized monitoring service. LUA in Redis another way to apply computation close to the data.

LUA is not production ready in Redis today.
On demand scripting means service providers can’t guarantee their SLA. A loaded script can do anything. What service provider would want to take the risk of blowing their SLA because of someone elses code? A deployment model would be better. It would allow for code review and benchmarking, so resource usage and performance could be properly calculated.

Redis as the next high performance stream processing platform.
It has pub-sub and scripting. Why not?